Design of Speech Corpus for Open Domain Urdu Text to Speech System Using Greedy Algorithm

نویسندگان

  • Wajiha Habib
  • Rida Hijab Basit
  • Sarmad Hussain
  • Farah Adeeba
چکیده

Unit selection speech synthesis is one of the most widely used techniques for high quality text to speech (TTS) systems. A unit selection text to speech system requires a large database of recorded and annotated speech, which contains both phonetic and prosodic variations. Designing phonetically rich and balanced speech corpora with minimum number of utterances is an intricate task. Several optimization methods are used for this purpose and "Greedy algorithm" is one of them. This paper introduces a greedy algorithm, which maximizes the coverage of high frequency unigrams, bigrams and trigrams while selecting minimal number of sentences from input corpus. The algorithm has been applied on different corpora collected from different domains and a speech corpus for Urdu TTS system is designed. A significant coverage of tri-phone has also been achieved.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Speech Corpus Development for a Speaker Independent Spontaneous Urdu Speech Recognition System

This paper reports the design and development of an 82 speaker Urdu speech corpus for speaker independent spontaneous speech recognition using the CMU Sphinx Open Source Toolkit for Speech Recognition. The corpus consists of 45 hours of spontaneous and read speech data from 82 speakers (42 male and 40 female), recorded over a microphone and a telephone line. The speech was collected from speake...

متن کامل

Hidden Markov Model (HMM) based Speech Synthesis for Urdu Language

This paper describes the development of HMM based speech synthesizer for Urdu language using the HTStoolkit. It describes the modifications needed to original HTS-Demo-scripts to port them, for Urdu language, which are currently available for English, Japanese and Portuguese. That includes the generation of the fullcontext style labels and the creation of the Question file for Urdu phone set. F...

متن کامل

Corpus Creation for Polish Unit Selection Speech Synthesis

This paper describes the process of creating speech corpus for Polish Unit Selection speech synthesis. This task is time-consuming and manually designing the corpus is, in practice, only applicable in Limited Domain Speech Synthesis and Recognition. The sentence selection tools used while designing the corpus are usually based on the Greedy algorithm. The algorithm looks for sentences which cov...

متن کامل

Slovenian Text-to-Speech Synthesis for Speech User Interfaces

The paper presents the design concept of a unitselection text-to-speech synthesis system for the Slovenian language. Due to its modular and upgradable architecture, the system can be used in a variety of speech user interface applications, ranging from server carrier-grade voice portal applications, desktop user interfaces to specialized embedded devices. Since memory and processing power requi...

متن کامل

Urdu and Hindi: Translation and sharing of linguistic resources

Hindi and Urdu share a common phonology, morphology and grammar but are written in different scripts. In addition, the vocabularies have also diverged significantly especially in the written form. In this paper we show that we can get reasonable quality translations (we estimated the Translation Error rate at 18%) between the two languages even in absence of a parallel corpus. Linguistic resour...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014